Intro to Statistics

Variables, Samples, Population, Data

Bogdan G. Popescu

John Cabot University

Variables

How are two or more variables related?

  • An independent variable is thought to influence or cause variation in another variable
  • A dependent variable depends upon or is caused by variation in the independent variable

Examples:
Independent Variable → Dependent Variable

Education → Income

Associations

Two variables are associated if knowing the value of one of them will help to predict the value of the other.

Example:
Life Expectancy and Urbanization

Life Expectancy and Urbanization

Life Expectancy

  • Indicator of countries’ overall health, physical well-being
  • Of interest to many, including health researchers, economists, sociologists, anthropologists…

We will examine UN data on average life expectancy for 214 countries.

We want to know if urbanization has a positive or negative relationship on life expectancy.

Life Expectancy and Urbanization

The Data

Life Expectancy vs Urbanization

  • A row for each country
  • Countries are subjects, cases, units, or elements in the data set
  • Two columns for life expectancy and urbanization
  • These are variables or characteristics varying among units

Life Expectancy and Urbanization

  • What explains variation in life expectancy?
  • What characteristics do countries with longer life expectancy have in common?
  • What characteristics do countries with shorter life expectancy have in common?
  • Normally, we would be examining the relationship between different types of variables and life expectancy:
    • income
    • urbanization
    • education

Correlates of Life Expectancy

Correlates of Life Expectancy

A scatterplot of life expectancy by urbanization

Scatterplot 1

Associations vs. Causal Relationships

  • If two variables are associated, knowing the value of one helps predict the value of the other
  • In this example, we would predict:
    • A middle-income country would have a longer life expectancy than a low-income country
    • A country with more urbanization would have longer life expectancy than one with less

Causal Relationships

A causal relationship entails three elements:

  • The independent (X) and dependent variables (Y) covary
  • The change in X precedes the change in Y
  • The covariation between X and Y is not coincidental or spurious

Causal relationships can be stipulated in hypotheses.

Hypotheses

Relationships between variables can be stated in hypotheses.

A hypothesis is an explicit statement about the relationship between phenomena that formalizes the researcher’s informed guess.

Characteristics of Good Hypotheses

  • Empirical statements that formulate educated guesses
  • Logical reason to think data can confirm hypotheses
  • Indicate direction of the relationship
  • Terms must match testing methods
  • Data should be feasible to obtain
  • Must specify unit of analysis (individuals, orgs, states, etc.)

Examples of Hypotheses

People tend to adopt political viewpoints similar to their parents.
Democracies are more likely to engage in trade with one another.
Authoritarian regimes are more likely to violate human rights.
Countries where property rights are protected tend to have higher levels of development.

Concepts

Definitions of concepts should be:

  • clear
  • accurate
  • precise
  • informative

Concepts should strike a balance between the specific and the abstract.

Populations vs. Samples

Population – complete enumeration of some set of interest

To learn about the population, a sample is often studied

Sampling is the process of selecting a subset from the population

Sampling is used to estimate characteristics of the full population

Aim: Ensure sample is representative

Requirement: Know your population

Dominant approach: probability sampling

Populations vs. Samples

Representative sample – If repeated, the sample’s features would match those of the population on average

Probability sampling reduces sample selection bias and ensures representativeness

Data and Variables – Basics

Categorical

  • Binary: e.g., 0 = unemployed, 1 = employed
  • Nominal: Order does not matter (e.g., 0 = Green, 1 = Red, 3 = Blue)
  • Ordinal: Order is meaningful (e.g., 0 = Poor, 1 = Fair, 2 = Good)

Data and Variables – Basics

Numerical

  • Discrete: e.g., number of individuals in a household
  • Continuous: e.g., height, weight, wages

Cross-Sectional Data

  • Cross-sectional datasets have one observation per unit
  • Data for one variable (attribute) measured in N countries is written as:

\[ \{X_1, X_2, X_3, \dots, X_N\} = \{X_i\}_{i=1,\dots,N} \]

Cross-Sectional Data

Cross-Sectional Data

Cross-Sectional Data

Cross-Sectional Data

Cross-Sectional Data

Cross-Sectional Data

  • Cross-sectional datasets have one observation per unit
  • Example values for one variable (e.g., life expectancy):

\[ \{X_1, X_2, X_3, \dots, X_N\} = \{X_i\}_{i=1,\dots,N} \]

\[ \{45.38333, 68.28611, 57.53013, \dots, 77.04861\} = \{X_i\}_{i=1,\dots,N} \]

Cross-Sectional Data

  • If we measure two attributes, we can represent them as a point in 2D space
  • A single data point is a vector in two dimensions

Example:
- Life expectancy = 59.75
- Level of urbanization = 66.4

Then the data point is:

\[ X = [66.4, 59.75] \]

Cross-Sectional Data

Cross-Sectional Data

Cross-Sectional Data

Cross-Sectional Data

Cross-Sectional Data

Cross-Sectional Data

Cross-Sectional Data

Cross-Sectional Data

Evaluation of Empirical Propositions

Social scientists use statistical analyses to verify theories driven by carefully thought-out hypotheses.

Hypotheses are falsifiable claims about the world.

Hypotheses connect dependent variables to independent variables.

- Dependent variables: outcomes or things we want to explain
- Independent variables: factors that help explain the dependent variable

Example

Hypothesis:
An increase in X (independent variable) leads to an increase in Y (dependent variable).

Democratization Hypothesis:
More economic development is associated with higher levels of democracy.

To test this, we collect data on X and Y.

Units of analysis are the entities where our theory applies (e.g., countries, individuals, firms).

Datasets

When we collect the data, we input it into a spreadsheet, a tabular format.

This becomes a dataset.

A Dataset

A Dataset

A Dataset

In this example, there appears to be a positive relationship between X and Y.

- Not all high-X observations have high Y
- Not all low-X observations have low Y

To evaluate the relationship, we fit a line that best approximates the pattern in the data.

A Dataset

Cross-Sectional Data

Each country’s data is a point in a scatter plot.

If we measure three variables (e.g., life expectancy, urbanization, education),
we get a 3D point cloud:

Time-Series Data

  • A time series of length T is written as:

\[ \{X_1, X_2, X_3, \dots, X_T\} = \{X_t\}_{t=1,\dots,T} \]

  • A time series is a sequence of data points indexed in time order
  • It has a natural temporal ordering
  • Time is the second attribute

Time-Series Data

Time-Series Data

Time-Series Data

Time-Series Data

Time-Series Data

Time-Series Data

Time-Series Data

Time-Series Data vs. Cross-Section

Time-Series Data vs. Cross-Section

Time-Series Data

This is depicted as a 2D scatter.

Time is one variable, and the value of interest is another.

So, each point in the time series is a pair: (time, value).

Time-Series Data

Time-Series and Cross-Section Data

The following is a cross-section of time-series data:

Time-Series and Cross-Section Data

Balanced Panel

Time-Series and Cross-Section Data

Unbalanced Panel

Time-Series and Cross-Section Data

Balanced Panel

Time-Series and Cross-Section Data

Unbalanced Panel

Conclusion

  • Measurement quality depends on accuracy and precision
  • Reliability: can we replicate results?
  • Validity: does the measure reflect the concept?
  • Variables can be categorical or numerical
  • Data can be cross-sectional, time-series, or both (panel data)
  • Panel data can be balanced or unbalanced